4  Importing Data into R - Practical

In this practical, I’m going to walk-through a complete process of data importing. This will draw on learning from Week One as well as pre-class reading for this week.

4.1 Part One: Working with CSV Data (1)

Step One: Creation of a Synthetic Dataset

Demonstration

  • Run the following code. As you run each line, attempt to think-through what the code is doing.

  • Note how useful (or not) the comments are in understanding what the code is doing.

  • Note what happens in your Environment window when you run each line.

# Set seed for reproducibility
set.seed(42)

# Create synthetic data for an ice hockey dataset
player_id <- paste("Player", 1:100)  # Player id
team_names <- sample(c("Team A", "Team B", "Team C", "Team D"), 100, replace = TRUE)  # Teams
goals_scored <- sample(0:10, 100, replace = TRUE)  # Goals scored by each player
assists <- sample(0:15, 100, replace = TRUE)  # Assists by each player
penalty_minutes <- sample(0:20, 100, replace = TRUE)  # Penalty minutes

# Combine into a data frame
hockey_data <- data.frame(Player = player_id,
                          Team = team_names,
                          Goals = goals_scored,
                          Assists = assists,
                          Penalty_Minutes = penalty_minutes)

# Display the first few rows of the data
head(hockey_data)
    Player   Team Goals Assists Penalty_Minutes
1 Player 1 Team A     1       8               8
2 Player 2 Team A     7       0              10
3 Player 3 Team A     0      13               5
4 Player 4 Team A     1      12               9
5 Player 5 Team B     4      14               8
6 Player 6 Team D     7      13              18

Practice

Now, create a new data set called [hockey_data_02]. Change the team names, the range of goals scored, and at least one of the variable names.

Step Two: Saving the Data

Demonstration

In the following code, I save my dataframe ‘hockey_data’ as a .csv file. Notice where this file is stored (look in the ‘Files’ window).

# Save the dataset as a CSV file
write.csv(hockey_data, file = "hockey_data.csv", row.names = FALSE)

Practice

Repeat the above, saving your dataframe ‘hockey_data_02’ as a .csv file.

Now, save the same dataframe to your University OneDrive folder.

Step Three: Importing the Data

Demonstration

First, I am going to clear my environment. You can do this with the brush tool, or use:

rm(list=ls()) # This clears the environment
# Import the dataset back into R

imported_data <- read.csv("hockey_data.csv")

# Display the imported data
head(imported_data)
    Player   Team Goals Assists Penalty_Minutes
1 Player 1 Team A     1       8               8
2 Player 2 Team A     7       0              10
3 Player 3 Team A     0      13               5
4 Player 4 Team A     1      12               9
5 Player 5 Team B     4      14               8
6 Player 6 Team D     7      13              18

Practice

Repeat these steps for the ‘hockey_data_02’ file you saved earlier.

4.2 Part Two: Working with CSV Data (2)

Now, we’ll move on to explore the differences between importing and exporting CSV files in R with and without row names.

We’ll start by creating a synthetic dataset based on netball player statistics, and then demonstrate how to save this dataset to a CSV file both with and without row names.

Finally, we’ll show how to import these CSV files back into R.

Step 1: Create a Synthetic Dataset

Let’s begin by creating a synthetic dataset that contains information about netball players. This dataset will include columns for [Player], [Position], [Goals], and [Assists].

# Create a synthetic netball dataset
netball_data <- data.frame(
  Player = c("Alice", "Bella", "Catherine", "Diana", "Emily"),
  Position = c("Goal Shooter", "Wing Attack", "Goal Keeper", "Centre", "Goal Defense"),
  Goals = c(45, 30, 0, 15, 5),
  Assists = c(10, 25, 5, 20, 10)
)

# View the dataset
print(netball_data)
     Player     Position Goals Assists
1     Alice Goal Shooter    45      10
2     Bella  Wing Attack    30      25
3 Catherine  Goal Keeper     0       5
4     Diana       Centre    15      20
5     Emily Goal Defense     5      10

Step 2: Exporting the CSV File

Now, we’ll export this dataset to a CSV file. We’ll do this twice: once with row names and once without.

# Export with row names
write.csv(netball_data, "netball_with_rownames.csv", row.names = TRUE)

Check your working directory. You should see a CSV file named [netball_with_rownames.csv].

# Export without row names
write.csv(netball_data, "netball_without_rownames.csv", row.names = FALSE)

This will create a CSV file named [netball_without_rownames.csv] without any row numbers as a separate column.

Step 3: Importing the CSV Files

Next, we’ll import these CSV files back into R to observe the differences.

# Import with row names
netball_with_rownames <- read.csv("netball_with_rownames.csv", row.names = 1)
print(netball_with_rownames)
     Player     Position Goals Assists
1     Alice Goal Shooter    45      10
2     Bella  Wing Attack    30      25
3 Catherine  Goal Keeper     0       5
4     Diana       Centre    15      20
5     Emily Goal Defense     5      10

Notice that, by specifying row.names = 1, we tell R to use the first column as row names. The imported data will appear similar to the original dataframe.

# Import without row names
netball_without_rownames <- read.csv("netball_without_rownames.csv")
print(netball_without_rownames)
     Player     Position Goals Assists
1     Alice Goal Shooter    45      10
2     Bella  Wing Attack    30      25
3 Catherine  Goal Keeper     0       5
4     Diana       Centre    15      20
5     Emily Goal Defense     5      10

Since we exported this file without row names, we don’t need to specify the row.names parameter. The data will be imported as is, with R automatically assigning default row numbers starting from 1.

Step 4: Comparing the Datasets

Let’s compare the datasets to see the difference:

# Compare the datasets
print("Dataset with Row Names:")
[1] "Dataset with Row Names:"
print(netball_with_rownames)
     Player     Position Goals Assists
1     Alice Goal Shooter    45      10
2     Bella  Wing Attack    30      25
3 Catherine  Goal Keeper     0       5
4     Diana       Centre    15      20
5     Emily Goal Defense     5      10
print("Dataset without Row Names:")
[1] "Dataset without Row Names:"
print(netball_without_rownames)
     Player     Position Goals Assists
1     Alice Goal Shooter    45      10
2     Bella  Wing Attack    30      25
3 Catherine  Goal Keeper     0       5
4     Diana       Centre    15      20
5     Emily Goal Defense     5      10

In the first dataset, you’ll see that the row names from the original export have been preserved, while in the second, R has automatically assigned new row numbers starting from 1.

Step 5: Adding Missing Row Names

If your data doesn’t have meaningful row names, you might want to add them. Adding row names can be particularly useful for identifying rows in your dataset when the row names have a specific, meaningful context, such as player names or unique identifiers.

# Create a synthetic netball dataset
netball_data <- data.frame(
  Player = c("Alice", "Bella", "Catherine", "Diana", "Emily"),
  Position = c("Goal Shooter", "Wing Attack", "Goal Keeper", "Centre", "Goal Defense"),
  Goals = c(45, 30, 0, 15, 5),
  Assists = c(10, 25, 5, 20, 10)
)

# Assign player names as row names
row.names(netball_data) <- netball_data$Player

# Remove the Player column
netball_data <- netball_data[ , -1]

# View the updated dataset
print(netball_data)
              Position Goals Assists
Alice     Goal Shooter    45      10
Bella      Wing Attack    30      25
Catherine  Goal Keeper     0       5
Diana           Centre    15      20
Emily     Goal Defense     5      10
# Export with row names
write.csv(netball_data, "netball_with_custom_rownames.csv", row.names = TRUE)

Setting Variable Types

  • Examine the dataset (which function/s can you use?)

  • Are all the variables interpreted correctly when R imports the CSV file?

  • If not, what do you need to do to correct this?

4.3 Practical Activity

  1. Using the code provided above, create your own synthetic dataset.
  2. Save the dataset as a .csv file.
  3. Clear your environment, then import your dataset back into the environment as a dataframe.